Overview

Aircasting is an open-source, end-to-end solution for collecting, displaying, and sharing health and environmental data. The idea is to scrape air quality data especially PM2.5 levels across Philadelphia off of the website and find a way to characterize them graphically and analytically.

Introduction

Asthma is a prototypical complex disease for which studying genetic and environmental factors simultaneously may lead to greater breakthroughs in understanding of pathophysiology than studying genetics or the environment in isolation. However, there have been few attempts to simultaneously and comprehensively address the influence of genetics and the environment on asthma. In this project we will make the first important step in characterizing environmental parameters especially the air quality as expressed by PM2.5 levels that might have a correlation with asthama levels across the US. A good visulaiation of the geographical variation of the PM2.5 levels will give us clues to better understand respiratory health in the United States.

The project is unique in that it sits at the intersection of Medicine, Computer Science and Art. Creative visulaizations often require an artistic approach which in our case is implemented via the Statistical Computer software, R. However, the visualzations will become meaningful only when augmented by solid Medical domain knowledge. Yet another unique opportunity presented by Dr. Himes for the project is the ability to collect our own live data using the Aircasting sensors. Once the data is collected, cleaned and visulaised for Philadelphia, we will extend the study to other cities and states and thereafter use Machine Learning algorithms to predict the chance of suffering from Asthama for people belonging to a given city.

Methods

The data was obtained as a set of CSV files from the Aircasting website. In it’s raw form it was rather difficult to run visualisation/ modelling routines on it. This was mainly because, measurements from different sensors were appended vertically and not horizontally resulting in multiple header rows scattered throughout the dataset. The following data snippet illustrates this:

raw_snip <- read.csv('/Users/sauravbose/Data Science/Bioinformatics/Aircasting Data/raw_snip.csv')
raw_snip

In order to transform the raw data into a more manageable form, Python was used in favour of R due to its superior data processing capabilities. The following code identifies the various sensor measurements contained in the input CSV files and splits them into sensor-specific CSV files. Each of the input files contains data corresponding to Temperature, Sound Level, Humidity and PM 2.5 measurements. The Python code reads in all these files and generates 4 CSV files, one for each of the sensors.

# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
import string
import os

#path = r'/Users/sauravbose/Data Science/Bioinformatics/Aircasting Data/sessions_20171113232917/'
path = r'/Users/sauravbose/Data Science/Bioinformatics/Aircasting Data/sessions_20171113233300/'

#files = ['test2.csv', 'data.xlsx','session_37219_greys_ferry_blue_20171113-12198-emi888.csv']
#files = os.listdir(r"/Users/sauravbose/Data Science/Bioinformatics/Aircasting Data/sessions_20171113232917")

files = os.listdir(r"/Users/sauravbose/Data Science/Bioinformatics/Aircasting Data/sessions_20171113233300/")



#For excel files
#df=pd.read_excel(path+files[1], sep = '',header=None)
#df = df.applymap(str)

#For CSV files
df=pd.read_csv(path+files[4], sep = ',',header=None)
df=df.dropna()


#df.index=df[0].map(lambda x: not x.isdigit()).cumsum()


   
truth = df[3].map(lambda x: not x.replace('.','').isdigit())

counter = 1
counter_temp = 1

idx = []

for i in range(truth.size):
    if (truth[i] == True):
        idx.append(counter) 
        if (truth[i+1] == False):
            counter_temp+=1
    elif (truth[i] == False):
        idx.append(counter) 
        if (i!= truth.size-1 and truth[i+1] == True):
            counter = counter_temp

df.index = idx

gp=df.groupby(df.index)

#df2=np.hstack([gp.get_group(i) for i in gp.groups])



data = {}

for i in gp.groups:
    data[str(i)]  = pd.DataFrame(np.array(gp.get_group(i))[1:], columns = np.array(gp.get_group(i))[0])
    
 
keys = data.keys()
response = []
units=[]
for i in keys:
    response.append(data["{}".format(i)]["sensor:capability"][0])                  
    units.append(data["{}".format(i)]["sensor:units"][0])      
    
    
output_title = [a+b+c+d for a,b,c,d in zip(response,["("]*len(response),units,[")"]*len(response))]    

    
for i in range(len(keys)):
    data["{}".format(keys[i])]["sensor:units"][1] = output_title[i]    
    
    
for i in keys:
    data["{}".format(i)].columns = data["{}".format(i)].iloc[1]   
    data["{}".format(i)] = data["{}".format(i)].reindex(data["{}".format(i)].index.drop([0,1]))    
   
    
for i in range(len(keys)):
    with open(r'/Users/sauravbose/Data Science/Bioinformatics/Aircasting Data/Clean_Data/{}.csv'.format(response[i]), 'a') as f:
             data["{}".format(keys[i])].to_csv(f, header=False,index = False)
    #data["{}".format(keys[i])].to_csv(r'/Users/sauravbose/Data Science/Bioinformatics/Aircasting Data/Clean_Data/{}.csv'.format(response[i]), index = False)                                  

                                

The output files ar shown below:

dir('/Users/sauravbose/Data Science/Bioinformatics/Aircasting Data/Clean_Data')
## [1] "Humidity.csv"           "Particulate Matter.csv"
## [3] "Sound Level.csv"        "Temperature.csv"

These sensor-specific CSV files are now read into R and are ready to be analysed.

temp <- read.csv('/Users/sauravbose/Data Science/Bioinformatics/Aircasting Data/Clean_Data/Temperature.csv', stringsAsFactors = F)

hum <- read.csv('/Users/sauravbose/Data Science/Bioinformatics/Aircasting Data/Clean_Data/Humidity.csv', stringsAsFactors = F)

pm <- read.csv('/Users/sauravbose/Data Science/Bioinformatics/Aircasting Data/Clean_Data/Particulate Matter.csv', stringsAsFactors = F)

sound <- read.csv('/Users/sauravbose/Data Science/Bioinformatics/Aircasting Data/Clean_Data/Sound Level.csv', stringsAsFactors = F)
s <- inner_join(temp,sound,by = c("Timestamp","geo.lat","geo.long"))
s <- s[!duplicated(s$Timestamp),]

s2 <- inner_join(s,hum,by = c("Timestamp","geo.lat","geo.long"))
s2 <- s2[!duplicated(s2$Timestamp),]

s3 <- inner_join(s2,pm,by = c("Timestamp","geo.lat","geo.long"))
s3 <- s3[!duplicated(s3$Timestamp),]

dim(s3)
data <- s3 %>% mutate(Timeofday = Timestamp)
data <- data %>% transform(Timeofday = strptime(Timeofday, "%Y-%m-%dT%H:%M:%S"))%>% transform(Timeofday= ifelse(Timeofday$min>30,Timeofday$hour+1,Timeofday$hour))


data <- data %>% rename(Temperature = Temperature.degrees.Fahrenheit., Sound = Sound.Level.decibels., Humidity = Humidity.percent., PM2.5 = Particulate.Matter.micrograms.per.cubic.meter., lon = geo.long, lat = geo.lat)

dim(data)
unique(data$Timeofday)

One all of the sensory data is cleaned, organised and loaded, here is what it looks like:

data

We have also imported similar data files for the Humidity, Sound level and PM 2.5 level measurements. As we can see from the above data snippet, each of the sensor measurements has geo-spacial coordinates and a time stamp associated with it. These act as additional degrees of freedom in our analysis.

Here is a brief quantitative summary of the sensor data:

data.plot <- data %>% select(Temperature, Sound, Humidity, PM2.5)
summary(data.plot)
##   Temperature        Sound          Humidity         PM2.5       
##  Min.   :59.00   Min.   :34.51   Min.   :28.00   Min.   : 0.820  
##  1st Qu.:60.00   1st Qu.:40.74   1st Qu.:49.00   1st Qu.: 3.330  
##  Median :60.00   Median :41.58   Median :52.00   Median : 4.580  
##  Mean   :62.83   Mean   :49.43   Mean   :50.31   Mean   : 5.317  
##  3rd Qu.:66.00   3rd Qu.:54.08   3rd Qu.:54.00   3rd Qu.: 5.960  
##  Max.   :80.00   Max.   :94.81   Max.   :68.00   Max.   :75.480

Now equipped with a clean dataset, we want to understand the relationship between our features: Temperature, Humidity, Sound Level, Time of day and Geographic Location on the variable of interest (response variable), PM 2.5. In order to do so, we resort to a graphical analysis that includes plotting histograms, correlation plots, parallel plots and geo-spacial maps.

Results

counties <- map_data("county")
penn_county <- counties %>% filter(region=="pennsylvania")
phil <- penn_county %>% filter(subregion == "philadelphia")
phil <- phil %>% rename(lon = long)

ll_means <- c(min(phil$lon), min(phil$lat))
map2 <- get_map(location = ll_means,  maptype = "hybrid", source = "google")


pm <- pm %>% rename(lon = geo.long, lat = geo.lat, PM2.5 = Particulate.Matter.micrograms.per.cubic.meter.)

ggmap(map2) + geom_point(data = pm, mapping =  aes(x = lon, y = lat, color = PM2.5), size = 2)+ 
  scale_colour_gradientn(colours = rev(heat.colors(7)), guide = "colourbar") +ggtitle("PM2.5 Measurements overlayed on Pennsylvania Map")

#ll_means <- sapply(pm[1:2], mean)

ll_means <- c(-75.175,39.95)
map2 <- get_map(location = ll_means,  maptype = "roadmap", source = "google", zoom =14)

#ggmap(map2) + geom_tile(data = pm, aes(x = lon, y = lat, fill = part))+ scale_color_gradientn(colours = rainbow(7))

ggmap(map2) + geom_point(data = pm, mapping =  aes(x = lon, y = lat, color = PM2.5), size = 2)+ 
  scale_colour_gradientn(colours = rev(heat.colors(7)),  guide = "colourbar", limits = c(0,30)) +ggtitle("PM2.5 Measurements overlayed on Philadelphia Map")

First, we look at the individual distributions of the variables.

mytheme <- theme(plot.title = element_text(family = "Helvetica", face = "bold", size = (15),hjust = 0.5), 
                 legend.title = element_text(colour = "black",  face = "bold.italic", family = "Helvetica"), 
                 legend.text = element_text(face = "italic", colour="steelblue4",family = "Helvetica"), 
                  axis.title = element_text(family = "Helvetica", size = (10), colour = "black"),
                  axis.text = element_text(family = "Courier", colour = "black", size = (10)))
  
#ggplot(data.plot)+geom_histogram(aes(x = Temperature), bins = 10)+mytheme+ggtitle("Histogram of Temperature Measurements")

temp <- temp %>% rename(Temperature = Temperature.degrees.Fahrenheit., lon = geo.long, lat = geo.lat)
ggplot(temp)+geom_histogram(aes(x = Temperature), bins = 10)+mytheme+ggtitle("Histogram of Temperature Measurements")

hum <- hum %>% rename(Humidity = Humidity.percent., lon = geo.long, lat = geo.lat)

ggplot(hum)+geom_histogram(aes(x = Humidity), bins = 10)+mytheme+ggtitle("Histogram of Humidity Measurements")

sound.hist <- sound[!duplicated(sound$Timestamp),]
sound.hist <- sound.hist %>% rename(Sound = Sound.Level.decibels., lon = geo.long, lat = geo.lat)
ggplot(sound.hist)+geom_histogram(aes(x = Sound), bins = 15)+mytheme+ggtitle("Histogram of Sound Measurements")

ggplot(pm)+geom_histogram(aes(x = PM2.5), bins = 30)+mytheme+ggtitle("Histogram of PM2.5 Measurements")

Next, we look at the relationships between PM2.5 and the other variables

ggplot(data.plot)+geom_point(aes(x = Temperature, y = PM2.5))+mytheme+ggtitle("Dependence of PM2.5 on Temperature Measurements")

ggplot(data.plot)+geom_point(aes(x = Humidity, y = PM2.5))+mytheme+ggtitle("Dependence of PM2.5 on Humidity Measurements")

ggplot(data.plot)+geom_point(aes(x = Temperature, y = PM2.5))

In order to summarise the relationships between all the variables we develop correlation plots and parallel plots.

The correlation plot is shown below:

# pick the numeric columns
data.comp.numeric <- data.plot %>% select_if(is.numeric)
# correlation table
corr.table <- melt(cor(data.comp.numeric)) %>% mutate(value = abs(value))
# reorder the columns by the abs corr with Salary
corr.table.pm2.5 <- corr.table %>% filter(Var2 == "PM2.5")
col.order <- order(corr.table.pm2.5$value)
data.comp.numeric.2 <- data.comp.numeric[, col.order]

# ordered correlation table
corr.table <- melt(cor(data.comp.numeric.2)) %>% mutate(value = abs(value))

ggplot(corr.table, aes(x=Var1, y=Var2)) + 
  geom_tile(aes(fill=value)) +
  scale_x_discrete(limits = rev(levels(corr.table$Var1))) +
  scale_fill_gradient( low = "#56B1F7", high = "#132B43") +     #lightblue to darkblue
  theme(axis.text.x = element_text(angle = 25, hjust = 1))

The plot doesn’t distinguish between positive and negative corelations and is used to simply assess the strength of the correlation between variables. It reveals that Temperature and Humidity do have a fair correlation with PM2.5 levels.

The relationship between the variables is further explored using parallel plots.

# 
# 
temp.min <- min(data.plot$Temperature)
temp.max <- max(data.plot$Temperature)
# temp.min <- 26
# temp.max <- 87
# temp.range <- temp.max-temp.min
# 
# # 
# # hum.min <- min(data.plot$Humidity)
# # hum.max <- max(data.plot$Humidity)
# # hum.range <- hum.max-hum.min
# 
# 
# #pm.min <- min(data.plot$PM2.5)
# #pm.max <- max(data.plot$PM2.5)
# #pm.range <- pm.max-pm.min
# 
# 
# # s.min <- min(data.plot$Sound)
# # s.max <- max(data.plot$Sound)
# #s.range <- s.max-s.min
# 
# 
# 
# data.cat <-  data.plot %>% mutate(Temperature = ifelse(Temperature < (temp.min+temp.range/3), "Low",ifelse(Temperature > (temp.min+2*temp.range/3),"High","Medium"))) %>% mutate(Humidity = ifelse(Humidity < 20, "Low",ifelse(Humidity > 50,"High","Medium"))) %>% mutate(PM2.5 = ifelse(PM2.5 < 12, "Low",ifelse(PM2.5 > 35,"High","Medium"))) %>% mutate(Sound = ifelse(Sound < 40, "Low",ifelse(Sound > 80,"High","Medium"))) %>% mutate(Temperature = as.factor(Temperature), Humidity = as.factor(Humidity), PM2.5 = as.factor( PM2.5),Sound= as.factor(Sound))
#  


#temp.min <- 26
#temp.max <- 87
temp.range <- temp.max-temp.min


hum.min <- min(data.plot$Humidity)
hum.max <- max(data.plot$Humidity)
hum.range <- hum.max-hum.min


pm.min <- min(data.plot$PM2.5)
pm.max <- max(data.plot$PM2.5)
pm.range <- pm.max-pm.min


s.min <- min(data.plot$Sound)
s.max <- max(data.plot$Sound)
s.range <- s.max-s.min



# data.cat <-  data.plot %>% mutate(Temperature = ifelse(Temperature < (temp.min+temp.range/3), "Low",ifelse(Temperature > (temp.min+2*temp.range/3),"High","Medium"))) %>% mutate(Humidity = ifelse(Humidity < (hum.min+hum.range/3), "Low",ifelse(Humidity > (hum.min+2*hum.range/3),"High","Medium"))) %>% mutate(PM2.5 = ifelse(PM2.5 < (pm.min+pm.range/3), "Low",ifelse(PM2.5 > (pm.min+2*pm.range/3),"High","Medium")))%>% mutate(Sound = ifelse(Sound < (s.min+s.range/3), "Low",ifelse(Sound > (s.min+2*s.range/3),"High","Medium")))%>% mutate(Temperature = as.factor(Temperature), Humidity = as.factor(Humidity), PM2.5 = as.factor(PM2.5), Sound = as.factor(Sound))

data.cat <-  data.plot %>% mutate(Temperature = ifelse(Temperature < (temp.min+temp.range/3), "Low",ifelse(Temperature > (temp.min+2*temp.range/3),"High","Medium"))) %>% mutate(Humidity = ifelse(Humidity < (hum.min+hum.range/3), "Low",ifelse(Humidity > (hum.min+2*hum.range/3),"High","Medium"))) %>% mutate(PM2.5 = ifelse(PM2.5 < 10, "Low",ifelse(PM2.5 > 30,"High","Medium")))%>% mutate(Sound = ifelse(Sound < (s.min+s.range/3), "Low",ifelse(Sound > (s.min+2*s.range/3),"High","Medium")))%>% mutate(Temperature = as.factor(Temperature), Humidity = as.factor(Humidity), PM2.5 = as.factor(PM2.5), Sound = as.factor(Sound))


data.temp <- data.cat %>% group_by(Temperature, Humidity, Sound, PM2.5) %>% summarise(Freq = n()) 

#Parallel plot
parallelset <- function(..., freq, col="gray", border=0, layer, 
                             alpha=0.5, gap.width=0.05) {
  p <- data.frame(..., freq, col, border, alpha, stringsAsFactors=FALSE)
  n <- nrow(p)
  if(missing(layer)) { layer <- 1:n }
  p$layer <- layer
  np <- ncol(p) - 5
  d <- p[ , 1:np, drop=FALSE]
  p <- p[ , -c(1:np), drop=FALSE]
  p$freq <- with(p, freq/sum(freq))
  col <- col2rgb(p$col, alpha=TRUE)
  if(!identical(alpha, FALSE)) { col["alpha", ] <- p$alpha*256 }
  p$col <- apply(col, 2, function(x) do.call(rgb, c(as.list(x), maxColorValue = 256)))
  getp <- function(i, d, f, w=gap.width) {
    a <- c(i, (1:ncol(d))[-i])
    o <- do.call(order, d[a])
    x <- c(0, cumsum(f[o])) * (1-w)
    x <- cbind(x[-length(x)], x[-1])
    gap <- cumsum( c(0L, diff(as.numeric(d[o,i])) != 0) )
    gap <- gap / max(gap) * w
    (x + gap)[order(o),]
  }
  dd <- lapply(seq_along(d), getp, d=d, f=p$freq)
  par(mar = c(0, 0, 2, 0) + 0.1, xpd=TRUE )
  plot(NULL, type="n",xlim=c(0, 1), ylim=c(np, 1),
       xaxt="n", yaxt="n", xaxs="i", yaxs="i", xlab='', ylab='', frame=FALSE)
  for(i in rev(order(p$layer)) ) {
     for(j in 1:(np-1) )
     polygon(c(dd[[j]][i,], rev(dd[[j+1]][i,])), c(j, j, j+1, j+1),
             col=p$col[i], border=p$border[i])
   }
   text(0, seq_along(dd), labels=names(d), adj=c(0,-2), font=2)
   for(j in seq_along(dd)) {
     ax <- lapply(split(dd[[j]], d[,j]), range)
     for(k in seq_along(ax)) {
       lines(ax[[k]], c(j, j))
       text(ax[[k]][1], j, labels=names(ax)[k], adj=c(0, -0.25))
     }
   }           
}


myt <- subset(data.temp, select=c("Temperature","Humidity","Sound","PM2.5","Freq"))
myt <- within(myt, {
    color <- ifelse(PM2.5=="Low","#008888",ifelse(PM2.5=="Medium","#330066", "#000080"))
})


with(myt, parallelset(Temperature, Humidity, Sound,  PM2.5, freq=Freq, col=color, alpha=0.2))

The definitions of High, Low and Medium are summarised below:

cat.sum <- data.frame(Temperature = c(paste(">",as.character(round(temp.min+2*temp.range/3,2))),paste("<",as.character(round(temp.min+temp.range/3,2))),paste(">=",as.character(round(temp.min+temp.range/3,2)),"and <=",as.character(round(temp.min+2*temp.range/3,2)))),Humidity = 
c(paste(">",as.character(round(hum.min+2*hum.range/3,2))),paste("<",as.character(round(hum.min+hum.range/3,2))),paste(">=",as.character(round(hum.min+hum.range/3,2)),"and <=",as.character(round(hum.min+2*hum.range/3,2)))), PM2.5 = 
c(paste(">",as.character(30)),paste("<",as.character(10)),paste(">=",as.character(10),"and <=",as.character(30))),Sound = c(paste(">",as.character(round(s.min+2*s.range/3,2))),paste("<",as.character(round(s.min+s.range/3,2))),paste(">=",as.character(round(s.min+s.range/3,2)),"and <=",as.character(round(s.min+2*s.range/3,2)))))

rownames(cat.sum) <- c("High", "Low", "Medium")

cat.sum

Each of the features which were originally numerical are converted into categorical variables with three categories: High, Medium and Low. This is done by computing the range for each of the variables and dividing them into three equal parts. This results in three intervals for each of the variables with the one corresponding to the highest numerical values being labeled as High and the lowest numerical values being labelled as Low. (Note that this is not the final way to categorize the data. This method was the simplest and hence implemented for the sake of testing the plots)

The above plot shows us that around University city, the PM2.5 levels were mainly less than 10. Of the small fraction of the time, the PM2.5 levels were greater than 10 and less than 30, the humidity levels were more than 41%, sound levels were between 54.61 and 74.71dB and temperatures were greater than 73 Farenheight.

To Do:

  1. Think of interesting ways to implement ML techniques.
  2. Add details to the write up